Unlocking new frontiers in leukemia diagnostics through large language Model–Driven report generation

Wuerf, Vivian; Maier, Christopher; Braun, Benjamin; Maschek, Sven; Haferlach, Torsten; Pohlkamp, Christian

doi:10.1182/blood-2025-2574

Abstract

Background: Writing of diagnostic reports is a cumbersome and challenging task. Expert knowledge and cross-methodological thinking is mandatory. Integration of Large Language Models (LLMs) could address unmet needs. For a proof of principle, we used report generation in cytomorphology of peripheral blood and bone marrow smears.

Aim: To introduce LLMs to support cytomorphologists, pathologists and technicians in routine report writing.

Methods: From 1/2024 to 4/2025, 56,676 cytomorphology cases were assessed in our lab routine, at least including MGG staining of PB and/or BM. Based on defined quality criteria, we selected a dataset of 16,261 cases and finally trained Llama 3.1 8B instruct as a base model for report generation. We used supervised fine-tuning (SFT) and a validation set of 578 samples. For each case, the LLM was provided with the intended diagnosis and tabular data from cytomorphological assessment and prompted to predict a report full text. During the training, model predictions were compared to corresponding human expert reports to calculate how accurately the LLM result matched the expected text using token-level cross-entropy loss. This measure was used to iteratively and automatically adjust the model's internal parameters. Using such feedback loops, we progressively improved the LLM's ability to generate reports matching the quality of human-authored reports.

To ensure accuracy and reliability of the Llama reports, we integrated Claude Sonnet 4.0 as a “judge model” as a second step. It uses predefined criteria to assess the predictions and mistakes made by the Llama model or by human experts, comparing them to tabular raw data from cytomorphological assessments. Categorizing any finding as critical, major or minor, it returns a score ranging from 0 to 10. We evaluated the Llama performance on a roughly balanced test set composed of 241 cases of the ten most common diagnoses. The judge ratings were validated by human experts, corrections were used for continuous improvement of the judge model.

The dual-model setup allows for a robust validation process, ensuring high reliability of generated reports. It was also helpful to compare performances of different LLMs on the given task.

Lastly, we employed the SBERT Score (ranging from 0 to 1) to measure semantic similarity between two pieces of text by comparing meaning rather than exact wording.

An integral part of our monitoring system is a feedback feature, enabling human experts to provide immediate feedback on the model performance. We systematically incorporate corrected cases from routine diagnostics into a dynamic training dataset, ensuring that the model is trained on diverse and up-to-date information.

Results: Since first integration into our routine workflow (8/2024), more than 27,900 cases have been processed using continuously improved versions of our LLM. In parallel we have seen increasing use and acceptance by human experts (today 86.6% of reports first generated by AI, followed by human control).

The current fine-tuned Llama model version achieved an average judge score of 8.22 on the test set, compared to the human expert score of 6.97. The average SBERT score over all predicted texts was 0.84, indicating a strong semantic alignment between the LLM output and expert reports.

Analysis of 3,898 cases processed in 7/2025 showed that in 79.2% of cases the LLM suggested texts needed no adjustments (32.3%) or only minor adjustments (46.9%, defined as a deviation of less than 20% of characters). This is supported by an SBERT score of ≥ 0.9 for 84.3% of cases. Savings in median processing time for human experts amounted to an impressive 51.0% in a random subset of the validation cohort.

Since the introduction of the feedback tool (5/2025), only 2.3% of cases were flagged. The majority were minor issues concerning wording (43.0% of flagged reports), whereas critical issues (e.g. regarding the diagnosis) only occurred in 6.7% of flagged cases.

Conclusion: We successfully integrated a fine-tuned Llama 3.1 8B model into our workflow for cytomorphological report writing. This led to higher accuracy of texts and shorter turnaround time. We strongly believe that advanced AI models, combined with rigorous monitoring mechanisms and continuous feedback loops, can be used in medical diagnostics to support human experts. This ensures not only the generation of accurate and clinically useful reports, but also the ongoing enhancement of the model's capabilities.

This content is only available as a PDF.

2025

Sign in via your Institution

Unlocking new frontiers in leukemia diagnostics through large language Model–Driven report generation

Abstract

Cited By

Email alerts

ASH Publications

American Society of Hematology

Unlocking new frontiers in leukemia diagnostics through large language Model–Driven report generation Free

Abstract

This feature is available to Subscribers Only

My Account

Cited By

Email alerts

ASH Publications

American Society of Hematology

This Feature Is Available To Subscribers Only

Unlocking new frontiers in leukemia diagnostics through large language Model–Driven report generation